Implementing a BNC-Compare-able Web Corpus

نویسنده

  • William H. Fletcher
چکیده

This paper details the author’s plans for and progress with compiling and analyzing a new gigaword English corpus from the web to complement his BNC-based online database “Phrases in English”. This new corpus represents the principal English-speaking countries in proportion to their population and will be linguistically annotated with the CLAWS4 tagger using a PoS-tagset comparable to those of the BNC and ANC. Parallel processing on multiple PCs will facilitate reaching the targeted size. This corpus will continue to grow dynamically in response to actual user queries to the author’s various web as corpus interfaces, but “snapshots” of each generation of the corpus will be preserved to ensure replicability of results. This report on work in progress will inspire discussion of the underlying concepts and suggestions for improvement.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Creation of a Spoken Sub-Corpus from the British National Corpus for Comparative Purposes

The British National Corpus (henceforth BNC) is one of the most frequently consulted corpora in linguistic research. While the use of this corpus is continuously on the increase, it appears that most BNC-related research work has exploited the corpus in its entirety, i.e. taking the corpus as a whole in analysing specific features or comparing with a different reference corpus. Despite the fact...

متن کامل

Comparing Knowledge Sources for Nominal Anaphora Resolution

We compare two ways of obtaining lexical knowledge for antecedent selection in other-anaphora and definite noun phrase coreference. Specifically, we compare an algorithm that relies on links encoded in the manually created lexical hierarchy WordNet and an algorithm that mines corpora by means of shallow lexico-semantic patterns. As corpora we use the British National Corpus (BNC), as well as th...

متن کامل

Web as Corpus

The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to colossal quantities of text, of any number of varieties, at the click of a button, fo...

متن کامل

Corpus Linguistics with BNCweb - a Practical Guide

Book synopsis This book presents a richly illustrated, hands-on discussion of one of the fastest growing fields in linguistics today. The authors address key methodological issues in corpus linguistics, such as collocations, keywords and the categorization of concordance lines. They show how these topics can be explored step-by-step with BNCweb, a user-friendly web-based tool that supports soph...

متن کامل

Clarifying the Concepts and Navigating a Path through the Bnc Jungle

In this paper, an attempt is first made to clarify and tease apart the somewhat confusing terms genre, register, text type, domain, sublanguage, and style. The use of these terms by various linguists and literary theorists working under different traditions or orientations will be examined and a possible way of synthesising their insights will be proposed and illustrated with reference to the d...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007